Using the Distributional Hypothesis to Derive Cooccurrence Scores from the British National Corpus
نویسنده
چکیده
In this paper I examine a number of cooccurrence-based scoring systems using the British National corpus to measure word association over wide contexts. The principal aim of this paper is to address the question of how to evaluate a given scoring system, or how to compare two scoring systems, without relying on a small list of example pairs and a ‘feel’ for the results. I evaluate these systems using i) a list of noun-noun pairs and ii) a simple test on aligned and misaligned sets of nouns. I also consider why noun-noun pairs are deemed appropriate for such mechanisms and explore the prospects for determining which words or lemmas will be appropriate for a distributional scoring approach. For my specific application an algorithm similar to MI-score operating across multiple windows offers the best results.
منابع مشابه
Testing the Distributional Hypothesis 1 Running head: TESTING THE DISTRIBUTIONAL HYPOTHESIS Testing the Distributional Hypothesis: The Influence of Context on Judgments of Semantic Similarity
Distributional information has recently been implicated as playing an important role in several aspects of language ability. Learning the meaning of a word is thought to be dependent, at least in part, on exposure to the word in its linguistic contexts of use. In two experiments, we manipulated subjects’ contextual experience with marginally familiar and nonce words. Results showed that similar...
متن کاملA Corpus-Based Study of the Lexical Make-up of Applied Linguistics Article Abstracts
This paper reports results from a corpus-based study that explored the frequency of words in the abstracts of applied linguistics journal articles. The abstracts of major articles in leading applied linguists journals, published since 2005 up to November 2001 were analyzed using software modules from the Compleat Lexical Tutor. The output includes a list of the most frequent content words, list...
متن کاملIdentifying semantic relations in a specialized corpus through distributional analysis of a cooccurrence tensor
We describe a method of encoding cooccurrence information in a three-way tensor from which HAL-style word space models can be derived. We use these models to identify semantic relations in a specialized corpus. Results suggest that the tensorbased methods we propose are more robust than the basic HAL model in some
متن کاملSituation and Text: Representation of Migrants Whilst the Escalation of Refugee Crisis in Great Britain as Compared to Russia
Increasing migration is a vital concern for a globalizing sociocultural environment in today’s world. The UK and developed European countries have become an attractive destination for asylum seekers (labelled as “migrants”) in the past decade. The rapid rise in the number of asylum seekers, which was labelled “migration crisis” (Ruz, 2015), made this topic an integral part of scientific discuss...
متن کاملUnsupervised induction of stochastic context-free grammars using distributional clustering
An algorithm is presented for learning a phrase-structure grammar from tagged text. It clusters sequences of tags together based on local distributional information, and selects clusters that satisfy a novel mutual information criterion. This criterion is shown to be related to the entropy of a random variable associated with the tree structures, and it is demonstrated that it selects linguisti...
متن کامل